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moderate to good, depending on the change. It was also found that all three 
tests were equally sensitive to changes in item difficulty and the guessing 
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Abstract 



In computerized adaptive testing, updating item parameter estimates using adaptive testing 
data is often called on-line calibration. In this paper, it is investigated how to evaluate 
whether the adaptive testing data used for on-line calibration sufficiently fit the item 
response model used. Three approaches are investigated, based on a Lagrange multiplier 
(LM) statistic, a Wald statistic and a cumulative sum (CUSUM) statistic. The power of the 
tests is evaluated with a number of simulation studies. 

Key words: Computerized Adaptive Testing, CUSUM-chart, Item Response Theory, 
Lagrange Multiplier Test, Model Fit, Modification Indices, On-line Calibration, Rao’s Effi- 
cient Score Test, 2-Parameter Logistic Model, 3-Parameter Logistic Model. 
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Introduction 

Computerized assessment, such as CBT (computer based testing) and CAT 
(computer adaptive testing), is based on the availability of a large pool of calibrated test 
items. Usually, the calibration process consists of two stages. 

(1) The pre-testing stage. In this stage, subsets of items are administered to subsets of 
respondents in a series of pre-test sessions, and an item response (IRT) model is fitted 
to the data to obtain item parameter estimates to support computerized test 
administration. 

(2) The on-line stage. In this stage, data are gathered in a computerized assessment 
environment. There may be several motives for using these data for further parameter 
estimation. The interest may be to continuously update estimates to attain the greatest 
possible precision. Or new, previously un-calibrated items may be entered into the 
bank and can only be calibrated using incoming responses. 

Closely related to the motives for on-line calibration, but also an aim in itself, is quality 
control, that is, checking whether pre-test and on-line results comply with the same IRT 
model. In the present paper, three methods of quality control are proposed. The first 
method is based on the Lagrange multiplier statistic. The method can be viewed as a 
generalization to adaptive testing of the modification indices for the 2-PL model and the 
nominal response model introduced by Glas (1997a, 1997b). The second method is based 
on a Wald statistic. The third method is based on a so-called cumulative sum (CUSUM) 
statistic. This last approach stems from the field of statistical quality control (see, for 
instance, Wetherill, 1977). Using this method in the framework of IRT-based adaptive 
testing was first suggested by Veerkamp (1996) in the framework of the Rasch model. In 
this paper, the procedure will be generalized to the 3-PL model. 

This paper is organized as follows. In Section 2, a framework for estimation of the 
2-PL model will be outlined, that will subsequently be used for a general introduction of 
the LM statistic in Section 3. Then, in the Sections 4 and 5, the LM and the Wald and 
CUSUM statistics will be applied to quality control in adaptive testing. In Section 6, the 
performance of the proposed methods will be evaluated with a number of simulation 
studies. Finally, in Section 7 some conclusions and suggestions for further research will be 
formulated. 

Before proceeding, a remark should be made with respect to the scope of this 
Q ~aper. Strictly speaking, the methods proposed here also apply to a situation where there is 
ERIC 0 pre-test stage and the item bank is bootstrapped during the on-line stage. However, 
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without a pre-test stage, in the initial stages of on-line calibration, the data on some of the 
items may be prohibitively scarce or even ill-conditioned, in the sense that there is too 
little information in the data to estimate all relevant parameters. Below, it will be assumed 
that the data are such that parameter estimates can be obtained. Generalization of the 
methods to be proposed to poor-conditioned data, probably by introducing prior 
distributions on the item parameters, is beyond the scope of the present paper and will be 
treated later. Further, it will be assumed that the number of items in the bank is such that 
standard errors of estimates can be computed using the complete information matrix. Also 
application of the procedures to very large item banks, where other approximations to the 
standard errors have to be made, are a point of future research. 

Preliminaries 

Consider dichotomous items where responses of persons labeled n to items labeled/ 
are coded = 0, and = 1 . The probability of a correct response is given by 

(t>,(e„)=Pr(X, = 1 |e„,a,p,Y,) 

= Y. + (1 -Yi)V,(9„) 

exp(aB„-P,.) 



where 0^ is the ability parameter of person n and p. and y. are the discrimination, 

difficulty and guessing parameter of item / , respectively. Since simultaneous ML 
estimates of all item parameters are hard to obtain (see, for instance, Swaminathan and 
Gifford, 1986), in the present paper it will be assumed that y. is fixed to some plausible 



constant, say, to the guessing probability. Using priors on y . to facilitate its estimation is 






a topic for future study. Below, the well-known theory of MML estimation for IRT 
models will be re-iterated. In this presentation the formalism of Glas (1992, 1997a, 1997b) 
will be used, which, as will become apparent in the sequel, is especially suited for the 
ntroduction of the procedures below. The choice of a distribution of ability is not 
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essential to the theory presented here; it can be the parametric MML framework (see Bock 
& Aitkin, 1982) or the non-parametric MML framework (see De Leeuw & Verhelst, 1986, 
Follmann, 1988). However, to make the presentation explicit, it is assumed that the ability 
distribution is normal with parameters p and c. Further, for reasons of simplicity, it is 
assumed that all respondents belong to the same population. Modem software for the 2- 
and 3-PL model, such as Bilog-MG (Zimowski, Muraki, Mislevy, & Bock, R.D., 1996), 
does not have this restriction, but this generalization is straightforward. So, letg(0^;p,a) 

be the density of 0. Further, Jet the item administration variable d^. take the value one if 
the item was administered to n and zero if this was not the case. If = 0 it will be 
assumed that x^. = c, where c is some arbitrary constant. 

Let and be the response pattern and the item administration vector of 

respondent n, respectively. With a reference to the ignorability principle by Rubin (1976), 
Mislevy (1986) asserts that in adaptive testing consistent ML estimates of the model 
parameters can be obtained maximizing the likelihood of responses conditionally on 

the design that is, the design can be ignored. So, if = (a',p',p,a) is the vector of all 
item and population parameters, the log-likelihood to be maximized can be written as 



where X stands for the data matrix and D stands for the design matrix. 

To derive the MML estimation equations, it proves convenient to introduce the 
vector of derivatives 



lnL(^;^,D) = EjnPr(x 



( 2 ) 




with 




(4) 




1997a, 1997b) adopts an identity due to Louis (1982) to write the first order 

7 



Quality Control of On-line Calibration - 4 



derivatives of (2) with respect to ^ as 



/i(^) = 






( 5 ) 



This identity greatly simplifies the derivation of the likelihood equations. For instance, 
using the short-hand notation \|/^. = \|/.(0 J and <l>„, = <l>j(0„)» from (3) and (4) it can be 
easily verified that 



b (a.) = 

r 






( 6 ) 



and 



b (B .) = d . 



-<t>J 



( 7 ) 



The likelihood equations for the item parameters are found upon inserting these 
expressions into (5) and equating these expressions to zero. To derive the likelihood 
equations for the population parameters, using (3) results in 

=(6„ - M)o:^ ■ (8) 



and 



b(a) = -o ' + (6„ - 



( 9 ) 



The likelihood equations are again found inserting these expressions in (5) and equating 
these expressions to zero. 






For computing estimation errors, and the LM, Wald and CUSUM statistics, also the 



second order derivatives of the log-likelihood function are needed. As with the derivation 
gD|^“of the estimation equations, also for the derivation of the matrix of second order 
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derivatives the theory by Louis (1982) can be used. Using Glas (1992), it follows that the 
observed information matrix, which is the opposite of the matrix of second order 
derivatives, that is. 









( 10 ) 



evaluated using MML estimates, is given by 









( 11 ) 



where 






8^1nPr(JC„.0„ 



( 12 ) 



Unfortunately, for the 3-PL model, the exact expressions for the second order derivatives 
become prohibitively complicated. However, Mislevy (1986) points out that the observed 
information matrix can be approximated as 

^ £(*„(^)|x„,d„,^£(^(^)|x„,d„,U'. (13) 

Simulation studies by Glas (1997b) in the framework of the 2-PL model and the nominal 
response model (Bock, 1972) show that this approximation is quite good, in the sense that 
statistics based on this approximation attain their theoretical distribution. In the sequel, it 
will become apparent that this must also holds for the 3-PL model. 

Lagrange multiplier tests 

Earlier applications of LM tests to the framework of IRT have been described by 
Glas and Verhelst (1995) and Glas (1997a, 1997a). The principle of the LM test 
(Aitchison & Silvey, 1958), and the equivalent efficient-score test (Rao, 1948) can be 
lummarized as follows. Consider a null-hypothesis about a model with parameters (t)^. 

9 
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This model is a special case of a general model with parameters ^ . In the present case the 
special model is derived from the general model by fixing one or more parameters to 

known constants. Let be partitioned as (1)^) = where c is the 

vector of the postulated constants and (|)qj is the vector of free parameters of the special 
model. Let /i((|)) be the partial derivatives of the log-likelihood of the general model, so 
/i((t)) = (3/3(t))lnL((t)). This vector of partial derivatives gauges the change of the log- 
likelihood as a function of local changes in (|) . Let be defined as 

. Then the LM statistic is given by 

LM = /i((t),y //((!),, (t)o)-‘ /i(4)o). 

If (14) is evaluated using the ML estimate of and the postulated values of c, it has an 

asymptotic “distribution with degrees of freedom equal to the number of parameters 
fixed (Aitchison & Silvey, 1958). 

An important computational aspect of the procedure is that at the point of the ML 
estimates the free parameters have a partial derivative equal to zero. Therefore, (14) 
can be computed as 

LM(c) = h{cyw-'k{c) (15) 



with 



where the partitioning of into , c) , , $„,) , , $o,), and 

^ 12 (^ 01 is according to the partition (t)o = (<t>ii ,<!)«) = (‘t'oi-'^O- 

Notice that //($„,,$„,) also plays a role in the Newton-Raphson procedure for 

solving the estimation equations and in computation of the observed information matrix. 
O ^»o its inverse will usually by available at the end of the estimation procedure. Further, if 
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the validity of the model of the null-hypothesis is tested against various alternative 
models, the computational task is relieved because the inverse of is already 

available and the order of W is equal to the number of parameters fixed, which must be 
small to keep the interpretation of the outcome tractable. 

The interpretation of the outcome of the test is supported by observing that the 
value of (15) depends on the magnitude of h{c), that is, on the first order derivatives with 
respect to the parameters evaluated in c. If the absolute values of these derivatives are 

large, the fixed parameters are bound to change once they are set free, and the test is 
significant, that is, the special model is rejected. If the absolute values of these derivatives 
are small, the fixed parameters will probably show little change should they be set free, 
that is, the values at which these parameters are fixed in the special model are adequate 
and the test is not significant, that is, the special model is not rejected. 

Lagrange Multiplier Statistics for Quality Control 



In the introduction section, it was already noted that simultaneous ML estimates of 
all item parameters in the 3 -PL model are hard to obtain (see, for instance, Swaminathan 
and Gifford, 1986). Therefore, in the present paper it will be assumed that the guessing 
parameter y. is fixed to some plausible constant, say, to the guessing probability. In this 

section, it will be shown how an LM statistic can be used for testing whether this fixed 
guessing parameter is appropriate and remains appropriate when confronted with the 
adaptive testing data. 

Consider G groups labeled g = 1 G and = 1 if person n belongs to group g, 

y„g ~ 0 otherwise. In this paper, the first group partakes in the pre-testing stage, and the 

following groups partake in the on-line stage. Given this partition, several hypothesis can 
be tested. For instance, Glas (1997a) suggests evaluating DIF by testing the hypothesis 
that item parameters are constant over groups, i.e, testing the hypothesis that a.^ = a. and 
= Pp for g=l,...,G. This can, of course, also be applied in an adaptive testing 
situation for monitoring parameter drift. However, in the present paper, a test for the 
hypothesis that y.^ = y., for g= 1,...,G will be given as an example of applying the LM 
approach to quality control of adaptive testing. The LM statistic for testing this hypothesis 
ERIC based on the first order derivatives with respect to y.^. For using (3), the first order 
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derivatives of (4) with respect to need to be computed. It is easily verified 

that 



b (y .) = y d . 

I ,g/ j ng n, 






(17) 



Let r. be a vector of the elements y.^, g = 1 ,...,G. A test for the null-hypothesis 
y = y. can be based on 

i tg i t 

LM{T.) = A(r)'W‘ A(r.) ( 18 ) 

with 



w = //,,(r.,r.) - 



(19) 



where ^ is the vector of the parameters of the null-model. Therefore, is the 

matrix of second order derivatives with respect to these parameters, that is, it is equivalent 
to the matrix defined by (10). If h(T) and W are evaluated using MML estimates of the 

null-model, i.e. the estimates of the LM{T.) statistic has an asymptotic -distribution 
with G degrees of freedom. 



A Wald test and a CUSUM chart for Quality Control 



The CUSUM chart is an instrument of statistical quality control used for detecting 
small changes in product features during the production process. The CUSUM chart is 
used in a sequential statistical test, where the null-hypothesis of no change is never 
accepted (Veerkamp, 1996). In the present case, the alternative hypothesis is that the item 
is becoming more easy and is loosing its discriminating power. Therefore, the null- 
hypothesis is a.^ - a.^ > 0 and p.^ - p.^ > 0, for groups of respondents labeled 

g= 1,...,G. As above, the first group partakes in the pre-testing stage, and the following 



IC ;roups are groups taking an adaptive test. 
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Before turning to the one-sided hypothesis a.^ - a.j > 0 and > 0, first 

consider the two-sided null-hypothesis that a.^ - a.^ = 0 and p.^ - p.^ =0. Let d.^ be a 
vector defined by d.^ = ( a.^ - a.^ , p.^ - p.^ )'. This two-sided hypothesis can be 
evaluated with the Wald statistic 



( 20 ) 

where W.^ is the covariance matrix of d.^. Since the statistic is computed using 
independent estimates of the item parameters in two groups, it holds thatW.^ = 
where and can be approximated using the relevant elements of the inverse of the 
opposite of (13), computed with the MML estimates obtained in group g and group 1, 
respectively. This statistic defined in (20) has an asymptotic -distribution with two 
degrees of freedom. However, the interest is in a one-side test, so also the signs of the 
elements of d.^ are needed. Since (20) is a quadratic form, its signed square root is of 
interest. Further, it may be interesting to test the hypothesis iteratively. Therefore, a one- 
sided cumulative sum chart will be based on the quantity 



S^ig) 



max 



s,(g-\) 



^ii-% 



p.-p 



il t' ig 



Se(a.^-aJ 5e(p,,-|3.Ja,,-a.p 



- k .,0 



( 21 ) 



W 



here Se(a.^-a.,) = and ■Se(P,; -P,J a,./ -a,.p = , with 

, Cp and the appropriate elements of the covariance matrix which is also used 
in (20). Further, k. is a reference value. The CUSUM chart starts with 



5,(0) = 0, 



( 22 ) 



and the null-hypothesis is rejected as soon as 






13 
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(23) 



where h. is some constant threshold value. The choice of the constants k. and/i. 
determines the power of the procedure. In the case of the Rasch model, where the null- 
hypothesis is - p.^ > 0, and the term involving the discrimination indices is lacking 

from (21), Veerkamp (1996) successfully uses k = 1/2 and h. = 5. This choice was 

motivated by the consideration that this set up has good power against the alternative 
hypothesis of a normalized shift in item difficulty of approximately one standard 
deviation. In the present case one extra normalized decision variable is employed, i.e., the 
variable involving the discrimination indices. To have power against a shift of one 
standard deviation of both normalized decision variables in the direction of the alternative 
hypothesis, a value k. = 1 will be tried out below. The value h. = 5 will not be changed. 

Examples 

In this section, the power of the procedures suggested above will be investigated 
using a number of simulation studies. Since all statistics involve an estimate of the 
standard error of the parameter estimates, and this; standard error is approximated using 
(13), the precision of this approximation will be studied first by assessing the power of the 
statistics under the null-model. Then the power of the tests will be studied under various 
model violations. 

For all simulations reported below, the ability parameters 0 were drawn from a 
standard normal distribution. The item difficulties p. were uniformly distributed on 
[-1.0, 1.0], the discrimination indices a. were drawn from a log-normal distribution 
with a zero mean and a standard deviation equal to 0.10, and the guessing parametery. 
was generally fixed at 0.20. In the on-line phase, item selection was done using the 
maximum information principle. The ability parameter 0 was estimated by its expected a- 
posteriori value (EAP), the initial prior was standard normal. 

The results of eight simulation studies with respect to the power of the statistics 
under the null-model are shown in Table 1, on the following page. The number of items 
K in the item bank was fixed at 50 for the first four studies and at 100 for the next for 
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Both in the pre-test phase and the on-line phase, test lengths L of 20 and 40 were chosen, 
the exact setup is shown in the first two columns of Table 1. Finally, in the third column 



Table 1 

Power of LM and Wald test under the null-model 
(100 replications) 



percentage significant at 10% 


K 


L 




LM test 


Wald test 


50 


20 


500 


8 


9 






1000 


10 


10 




40 


500 


9 


10 






1000 


11 


8 


100 


20 


500 


12 


10 






1000 


8 


9 




40 


500 


10 


12 






1000 


10 


10 



K size item pool 
L test length 

number of persons in calibration and adaptive testing batches 

it can be seen that the number of respondents per phase was fixed at 500 and 1000 
respondents. So summed over the pre-test and on-line phase, the sample sizes were 1000 
and 2000 respondents, respectively. For the pre-test phase, the a spiralled test 
administration design was used. For instance, for the /iC = 50 studies, for the pre-test 
phase, five subgroups were used, the first subgroup was administered the items 1 to 20, 
the second the items 1 1 to 30, the third the items 21 to 40 the fifth the items 31 to 50, and 
the last group received the items 1 to 10 and 41 to 50. In this manner, all items drew the 
same number of responses in the pre-test phase. For the /iC = 100 studies, for the pre-test 
phase four subgroups administered 50 items were made, so here the design was 1-50, 
26 - 75, 51 - 100 and 1 -25 and 76 - 100. For each study, 100 replications were run. 

The results of the study are shown in the last two columns of Table 1. These 
columns contain the percentages of LM and Wald tests that were significant at the 10% 
level. It can be seen that the power of the tests conforms its theoretical value of 10%. 
Therefore, it can be concluded that the approximations of the standard errors were quite 
close. 
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Table 2 

Detection of aberrant items: changes in 7 .. 



(per row: 100 replications for LM/Wald and 20 replications for CUSUM) 



from y. 


=.00 to 


=.25 
























significant at 10% 


CUSUM detected after iteration 


K 


L 




LM test 


Wald test 


2 


3 


4 


6 


8 


10 


50 


20 


500 


95 


69 


72 


77 


88 


100 


100 


100 






1000 


100 


70 


85 


90 


100 


100 


100 


100 




40 


500 


100 


100 


77 


83 


99 


100 


100 


100 






1000 


100 


100 


93 


98 


100 


100 


100 


100 


100 


20 


500 


92 


93 


69 


75 


92 


100 


100 


100 






1000 


98 


92 


81 


95 


100 


100 


100 


100 




40 


500 


100 


100 


73 


87 


100 


100 


100 


100 






1000 


100 


100 


88 


99 


100 


100 


100 


100 


from 7j 


=.20 to y. 


-.30 
























significant at 10% 


CUSUM detected after iteration 


K 


L 




LM test 


Wald test 


2 


3 


4 


6 


8 


10 


50 


20 


500 


10 


25 


2 


3 


4 


12 


33 


45 






1000 


40 


60 


2 


4 


4 


35 


58 


66 




40 


500 


31 


22 


3 


3 


4 


22 


44 


65 






1000 


55 


73 


4 


6 


7 


45 


56 


78 


100 


20 


500 


18 


21 


1 


2 


10 


11 


45 


50 






1000 


58 


47 


4 


5 


5 


13 


54 


67 




40 


500 


42 


44 


3 


4 


7 


32 


45 


75 






1000 


49 


77 


2 


6 


7 


22 


50 


76 


from 7 , 


.=.20 to 7 .: 

t 1 1 


=.40 
























significant at 10% 


CUSUM detected after iteration 


K 


L 


N 

g 


LM test 


Wald test 


2 


3 


4 


6 


8 


10 


50 


20 


500 


50 


44 


10 


15 


19 


40 


66 


70 . 






1000 


90 


60 


12 


18 


22 


50 


81 


82 




40 


500 


89 


97 


18 


26 


33 


76 


89 


100 






1000 


100 


99 


17 


24 


38 


73 


100 


100 


100 


20 


500 


52 


44 


9 


12 


18 


34 


75 


86 






1000 


88 


73 


11 


22 


25 


68 


79 


100 




40 


500 


90 


82 


19 


24 


31 


57 


83 


100 






1000 


100 


100 


18 


29 


30 


77 


100 


100 




A second series of simulations was focussed on the power in the case that the on- 
ine responses were given using a value for the guessing parameter 7 . that was different 
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from the value of the pre-test phase. The results are shown in Table 2. The first panel of 
the table pertains to a situation where, for the items 5, 10, 15, etc., y. changes from 0.00 

in the pre-test phase to 0.25 in the on-line phase. So 20% of the items do not fit the null- 
model of the pre-test phase. In the fourth and fifth column, the rejection rate of aberrant 
items using a 10% significance level is shown for the LM and Wald test, respectively. The 
number of replications was 100. It can be seen that the power of both tests is quite large. 
Then, for 20 replications, 9 more batches of size of respondents were generated and 

for each new batch, the CUSUM statistic defined by (21) was computed. In the last six 
columns the percentage of the detected aberrant items is shown. Non-aberrant items were 
detected at chance level, in this case 5%. It can be seen that approximately 100% of the 
aberrant items is detected after 4 iterations, which can be considered quite good. 

The positive picture of the power of the LM, Wald and CUSUM changes 
dramatically, if y. = 0.20 changes from 0.20 in the pre-test phase to 0.30 in the on-line 

phase. From the second panel of Table 2, it can be seen that in this case the power of the 
LM and Wald test is quite low, while even after 10 iterations the CUSUM procedure has 
only detected about half of the aberrant items. In the last panel of Table 2, y. changes 

from 0.20 to 0.40, and the power becomes better, although for the L = 20 studies, the 
power is still quite low. 

Note that in the above simulations, only the LM test is strictly aimed at the 
alternative that y . has changed. However, the estimates of the three parameters of the 3- 

PL model are highly correlated. This implies that changes in parameters are often 
confounded and it is very difficult to identify the actual parameter that is changing. For 
instance, if an item becomes known, this can both be translated into an augmentation of 
y^ that is, in an augmentation of a correct response unassociated with 6, in a loss of 
discriminating power, and in a lowering of item difficulty. As a consequence, a test that 
should be sensitive to changes in y. may also have power against changes in a. and p.. 

The latter case was investigated using the same simulation setup as above. The results are 
displayed in Table 3, the first panel pertains to a change of -0.50 in the difficulty of the 
items 5, 10, 15, 20, etc., the second panel pertains to a change -1.00 in the difficulty of 
these items. It can be seen that all tests are indeed sensitive to these changes, especially 
the power for the change -1.00 is very high. 
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Table 3 

Detection of aberrant items: changes in p.. 

(per row: 100 replications for LMAVald and 20 replications for CUSUM) 



change -0.50 








significant at 10% 


CUSUM detected after iteration 


K 


L 


N. 


LM test 


Wald test 


2 


3 


4 


6 


8 


10 


50 


20 


500 


25 


24 


12 


17 


33 


65 


96 


100 






1000 


27 


28 


10 


17 


28 


80 


91 


100 




40 


500 


24 


22 


12 


26 


38 


88 


99 


100 






1000 


30 


20 


15 


22 


41 


83 


100 


100 


100 


20 


500 


23 


21 


12 


18 


31 


94 


95 


100 






1000 


44 


33 


5 


20 


32 


78 


89 


100 




40 


500 


50 


42 


19 


24 


54 


87 


100 


100 






1000 


53 


44 


17 


23 


55 


87 


100 


100 


change - 


1.00 


























significant at 10% 


CUSUM detected after iteration 


K 


L 


N, 


LM test 


Wald test 


2 


3 


4 


6 


8 


10 


50 


20 


500 


99 


89 


80 


99 


100 


100 


100 


100 






1000 


90 


90 


85 


90 


100 


100 


100 


100 




40 


500 


89 


96 


87 


83 


100 


100 


100 


100 






1000 


94 


96 


89 


98 


100 


100 


100 


100 


100 


20 


500 


96 


98 


87 


95 


98 


100 


100 


100 






1000 


99 


92 


83 


95 


100 


100 


100 


100 




40 


500 


89 


94 


93 


97 


100 


100 


100 


100 






1000 


99 


99 


98 


99 


100 


100 


100 


100 



Discussion 

In this paper, it was explored how to evaluate whether the adaptive testing data 
used for on-line calibration sufficiently fit the item response model used. Three approaches 
were studied, one based on a Lagrange multiplier (LM) statistic, the others on a Wald and 
a cumulative sum (CUSUM) statistic, respectively. The theoretical advantage of the latter 
procedure is that it is based on a directional hypothesis and can be used iteratively. The 
power of the tests was evaluated with a number of simulation studies. It was found that 
the power of the procedures ranged from rather moderate for a change from y. = 0.20 to 

y. = 0.30, to good for a change from y. = 0.00 to y. = 0.25. Further, it was found that 

the tests are equally sensitive to changes in item difficulty and the guessing parameter. So 
the bottom line here is that all these statistics detect that something has happened to the 
parameters, but it will be very difficult to attribute misfit to specific parameters. 
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